This project is divided into two main sections.
The first section explores the evolution of tennis match durations from 1991 to 2023. This is the most challenging part of the project since we are working with raw, unfiltered real-world data. As a result, extensive preprocessing and exploratory analysis are required before applying time series models. The presence of missing values, inconsistencies, and potential outliers makes this phase particularly complex, as cleaning and structuring the dataset correctly is crucial for obtaining reliable results.
The second section is a simpler analysis of the search trends in Spain for the keywords “Playa” (Beach) and “Ofertas” (Deals). Unlike the first part, this dataset is expected to be cleaner and requires less preprocessing. The objective here is to observe and analyze seasonal patterns, trends, and possible correlations between these search terms, particularly in relation to factors such as weather conditions and holiday periods.
By comparing these two sections, we aim to demonstrate how different types of time series data require varying levels of preparation and modeling approaches, highlighting both the difficulties of working with real-world sports data and the insights that can be derived from structured online search trends.
In this study, we will conduct a detailed analysis of professional tennis matches from the ATP circuit played throughout the year, using time series analysis to identify trends and patterns over time. Our research will focus on the following key aspect
By leveraging time series techniques, this analysis aims to provide valuable insights into match dynamics and ranking fluctuations, helping players, coaches, and analysts better understand long-term performance trends and their implications.
Before conducting the analyses, I would like to highlight some important insights and propose a few hypotheses that I believe will later be reflected in the results. By outlining these key ideas beforehand, we can better understand the expectations of the study and compare them with the actual findings. These hypotheses will help guide the interpretation of trends in match duration, player performance, and ranking evolution throughout the year.
In professional tennis, tournaments are classified into different levels based on their prestige, ranking points, and prize money. The main categories in the ATP circuit include Grand Slams, Masters 1000, ATP 500, and ATP 250 events. Each of these tournaments varies in terms of competitiveness, match format, and player participation.
A key distinction is that Grand Slam matches are played in a best-of-five sets format, whereas all other ATP tournaments follow a best-of-three sets structure. This fundamental difference significantly impacts match duration, as Grand Slam matches tend to be longer on average. The four Grand Slam tournaments are held at different times of the year:
Given this, we can establish our first hypothesis: in our time series analysis, we expect to observe peaks in match duration during the periods when Grand Slam tournaments are played. These peaks should align with the dates of the Australian Open, Roland Garros, Wimbledon, and the US Open, reflecting the increased length of matches during these events.
Example: Artificially generated graph to illustrate the hypothesis.
As shown in the generated graph, we expect to observe peaks in match duration during Grand Slam events.
Tennis has undergone a significant evolution over the past few decades, becoming a much more physical sport. In the 1990s, the focus of the game was more direct, with an emphasis on quick points and aggressive shots. However, over time, the playing style has shifted towards greater consistency, with players prioritizing long rallies from the baseline and physical endurance. This change has led to longer matches, as players are now able to maintain a high level of play for extended periods. As a result, match duration has increased considerably, particularly over the last 20 years, reflecting the enhanced physical and mental capabilities of modern players.
Example: Artificially generated graph to illustrate the hypothesis.
This section covers essential data preprocessing steps, including handling missing values, normalization, and identifying outliers. We will then visualize trends and patterns to uncover insights from the data.
This first part will include the preprocessing of the data necessary for the later parts of the project.
Our dataset includes all ATP matches played from 1991 to 2023, allowing us to analyze the evolution of the sport over more than three decades. This will help us identify trends in match duration, player performance, and ranking changes. Our first step will be to merge the individual CSV files for each year to obtain our final dataframe.
Load the data
#Join all the csv files
carpeta <- "C:/Users/jorge/OneDrive/Desktop/Apuntes Master Uc3m/Tercer semicuatrimestre/Time Series/Trabajo 1/Tenis data"
archivos <- list.files(path = carpeta, pattern = "*.csv", full.names = TRUE)
ATP_data <- bind_rows(lapply(archivos, read.csv))
head(ATP_data)Our dataset contains a total of 49 variables. However, for the purposes of our analysis, we will focus only on the following eight variables:
These variables will allow us to examine match duration, tournament characteristics, and player performance over time.
Filtered data
# Select relevant columns
ATP_data_filtered <- dplyr::select(ATP_data, tourney_name, surface, tourney_date, winner_name, loser_name, score, best_of, minutes)
# Display the first rows of the filtered dataset
head(ATP_data_filtered)Check for Missing Values
We check first for missing values for latter clean them by imputing or deleting that data.
## tourney_name surface tourney_date winner_name loser_name score
## 0 0 0 0 0 0
## best_of minutes
## 0 13036
We have 13,000 matches where the duration in minutes has not been recorded. To handle these missing values, instead of removing the matches, we will replace each NA with the average duration of matches played in the same tournament during the same year. This approach ensures that we preserve the reliability and quality of the data by using the most relevant information, ensuring that the replacements are contextually accurate within the same tournament and year.
# Convert tourney_date to a year column
ATP_data_filtered$year <- substr(ATP_data_filtered$tourney_date, 1, 4)
# Replace NAs in 'minutes' with the average duration for the same tournament and year
ATP_data_filtered <- ATP_data_filtered %>%
group_by(tourney_name, year) %>%
mutate(minutes = ifelse(is.na(minutes),
mean(minutes, na.rm = TRUE), # Replace NA with the mean of the same tournament and year
minutes)) %>%
ungroup()
# Check the number of missing values in each column
colSums(is.na(ATP_data_filtered))## tourney_name surface tourney_date winner_name loser_name score
## 0 0 0 0 0 0
## best_of minutes year
## 0 9778 0
The remaining 9,700 NA values are from matches in the Davis Cup and other events that are not directly included in the ATP ranking system, where the match duration was not recorded. To avoid adding potentially misleading information, we have decided to remove these events from the dataset. This ensures that our analysis focuses on ATP tour matches, where data is consistently available and relevant.
# Remove rows where 'minutes' is NA
ATP_data_clean <- ATP_data_filtered %>%
filter(!is.na(minutes))
# Verify that the NAs have been removed
head(ATP_data_clean)In the next steps of our analysis, we will calculate outliers for the minutes column based on each tournament and year separately. This method ensures that the outlier detection takes into account the natural variations in match durations for different tournaments and years, rather than calculating outliers across the entire dataset. We will use the 1.5 * IQR rule to identify extreme values within each group (tournament and year). This approach will allow us to better understand and handle outliers, ensuring the reliability of the data used in subsequent analyses.
# Calculate outliers for 'minutes' within each 'tourney_name' and 'year'
ATP_data_clean <- ATP_data_clean %>%
group_by(tourney_name, year) %>%
mutate(
Q1 = quantile(minutes, 0.25, na.rm = TRUE),
Q3 = quantile(minutes, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR,
upper_bound = Q3 + 1.5 * IQR,
outlier = ifelse(minutes < lower_bound | minutes > upper_bound, TRUE, FALSE)
) %>%
ungroup()
# View the data with outliers marked
head(ATP_data_clean)How many outliers?
We count the total number of outliers that are in the dataset.
# Count the number of outliers in the dataset
outlier_count <- ATP_data_clean %>%
filter(outlier == TRUE) %>%
count()
# View the result
outlier_countWe have a total of 1640 outliers in our dataset.
The most extreme outliers
For this part, we are going to analyze the most extreme outliers to identify potential characteristics or specific moments that may have influenced match durations. This will allow us to detect external factors such as weather conditions, player fatigue, or tournament-specific rules that could have impacted the results. Additionally, by examining these extreme values, we can also verify whether any data points were recorded incorrectly in the dataset.
# Find the most extreme outlier (highest or lowest minutes)
extreme_outlier <- ATP_data_clean %>%
filter(outlier == TRUE) %>%
arrange(desc(minutes)) %>%
slice(0:10)
# View the most extreme outlier
extreme_outlierThe first two records in our outlier list are clear errors, as the match durations have been incorrectly recorded. However, the third record is not an error, it corresponds to the legendary Isner vs. Mahut match at Wimbledon 2010, which holds the record for the longest match in tennis history, lasting 11 hours and 5 minutes (665 minutes). Since our goal is to clean the dataset while maintaining data integrity, we will remove the first two erroneous outliers but keep the Isner vs. Mahut match, as it represents a real and significant event in tennis history.
# Read the image
image <- readJPEG("Foto wimbledon.jpg")
# Display the image using grid.raster
grid::grid.raster(image)We will remove these two entries as they have been incorrectly recorded in the dataset.
In this section, we will adapt the date format of the tourney_date column, which is currently stored as a numeric value in the YYYYMMdd format (e.g., 20230101), to a standard Date format that can be easily understood and processed by R. By converting the date into a proper Date class, we will be able to efficiently use it as a time index for our time series analysis, ensuring compatibility with functions and visualizations in R.
First step
In Grand Slam tournaments, unlike regular ATP events, matches are played over a two-week period instead of just one. This extended format means that the first week typically consists of early-round matches with a larger number of players, while the second week features later rounds with fewer but generally more competitive matches. Given this distinction, we will differentiate between matches played during the first week and those played during the second week to analyze potential differences in match duration and other characteristics.
# Define Grand Slam tournaments
grand_slam_tournaments <- c("Roland Garros", "Wimbledon", "Australian Open", "US Open")
# Modify the dataframe without filtering out non-Grand Slam matches
ATP_data_cleaned <- ATP_data_cleaned %>%
dplyr::group_by(tourney_name, tourney_date) %>%
dplyr::mutate(
row_number = dplyr::row_number(), # Number matches within each group
total_matches = dplyr::n(),
tourney_date = as.numeric(ifelse(
tourney_name %in% grand_slam_tournaments & row_number > total_matches / 2,
format(as.Date(as.character(tourney_date), format="%Y%m%d") + 7, "%Y%m%d"),
as.character(tourney_date)
))
) %>%
dplyr::ungroup() %>% # Ungroup before using select
dplyr::select(-row_number, -total_matches) # Remove auxiliary columnsSecond step
We aim to calculate the average match duration for each torunament in our dataset. Since tennis matches are played on different days across various tournaments, we will group the data by the tourney_date and then compute the mean duration of matches for each specific tournament. This approach will provide us with a clearer picture of how the match durations vary over time and help in identifying trends or patterns in the data, which is essential for our time series analysis.
# Use dplyr::select explicitly
mean_duration <- ATP_data_cleaned %>%
group_by(tourney_date) %>%
summarise(mean_minutes = mean(minutes, na.rm = TRUE),
surface = first(surface)) %>%
dplyr::select(tourney_date, mean_minutes, surface) # Explicit select from dplyr
# Convert the 'tourney_date' to Date format
mean_duration$tourney_date <- as.Date(as.character(mean_duration$tourney_date), format = "%Y%m%d")
# View the result
head(mean_duration)In the Data Visualization section, we will focus on exploring and visualizing the temporal patterns in the match durations across different tournaments and years.
In this section, we will conduct a study of the seasonality of ATP tennis matches, focusing on how the duration of the matches varies throughout the year. To facilitate the comparison, we will plot the data for each year in a single graph, allowing us to visually assess and compare the seasonal variations in match durations over time.
# Convert 'tourney_date' to Date format
mean_duration$tourney_date <- as.Date(mean_duration$tourney_date)
# Ensure 'mean_minutes' is numeric
mean_duration$mean_minutes <- as.numeric(mean_duration$mean_minutes)
# Filter out rows where 'mean_minutes' is NA
mean_duration_clean <- mean_duration %>%
filter(!is.na(mean_minutes))
# Extract year and day of the year (day of year is a number from 1 to 365 or 366)
mean_duration_clean <- mean_duration_clean %>%
mutate(year = year(tourney_date), day_of_year = yday(tourney_date))
# Plot the time series for each year, overlaying them on the same graph
ggplot(mean_duration_clean, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations by Day of Year",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Use a color scale to differentiate years1990 vs 2023
Now, let’s compare the oldest values from 1990 with the most recent ones from 2023 to see if we can identify any clear differences. This analysis will help us understand how match durations have evolved over time and whether factors such as changes in playing styles, court conditions, or tournament regulations have had a significant impact on the length of matches over the years.
# Filter only for 1990 and 2023
mean_duration_filtered <- mean_duration_clean %>%
filter(year %in% c(1991,2010, 2022, 2023))
# Plot the time series for 1990 and 2023
ggplot(mean_duration_filtered, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations: 1990 vs 2023",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Use a color scale to differentiate yearsLooking at a few representative values of match durations from 1990 to 2023, we can observe a significant increase in the average match duration over time. This rise in duration could be linked to several factors, such as the evolution of players’ playing styles, which have shifted from a more direct approach to one focused on consistency and physical endurance. Additionally, changes in court surfaces, as well as advancements in technology and player training, might also be contributing to the longer match durations over the years. This pattern suggests a trend toward more intense matches, which could have implications for game strategy and player performance analysis.
We will differentiate the evolution of match duration by court type. By analyzing the data based on surface types such as hard, clay, and grass, we can examine how the nature of each surface may influence the average duration of matches over time. Different surfaces have distinct playing characteristics, which can affect the speed of the game, the players’ movements, and the overall length of the match. This analysis will allow us to explore whether certain surfaces tend to produce longer or shorter matches and how this has evolved throughout the years.
Hard courts
On hard courts, we expect to see seasonality in the periods when the Australian Open and the US Open are played. The Australian Open typically takes place in January (mid to late January), while the US Open is held in late August to early September.
# Hard courts
mean_duration_clean_surface <- mean_duration_clean %>%
filter(surface == "Hard") %>%
mutate(year = year(tourney_date), day_of_year = yday(tourney_date))
# Plot the time series for each year, overlaying them on the same graph
ggplot(mean_duration_clean_surface, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations by Day of Year (Hard Surface)",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Use a color scale to differentiate yearsClay courts
On clay courts, we expect to see seasonality around the period when Roland Garros is played. The tournament typically takes place from late May to early June, making it the peak of the clay-court season
# Clay courts
mean_duration_clean_surface <- mean_duration_clean %>%
filter(surface == "Clay") %>%
mutate(year = year(tourney_date), day_of_year = yday(tourney_date))
# Plot the time series for each year, overlaying them on the same graph
ggplot(mean_duration_clean_surface, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations by Day of Year (Clay Surface)",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Use a color scale to differentiate yearsGrass courts
For grass courts, we expect to see seasonality around Wimbledon, which is typically held from late June to early July. However, it is important to highlight that the grass-court season is much shorter compared to hard and clay courts. As a result, the temporal evolution may not be as clear as in the two previously analyzed surfaces. The limited number of grass tournaments and their concentration within a few weeks make it harder to observe long-term trends or strong seasonal patterns.
# Grass courts
mean_duration_clean_surface <- mean_duration_clean %>%
filter(surface == "Grass") %>%
mutate(year = year(tourney_date), day_of_year = yday(tourney_date))
# Plot the time series for each year, overlaying them on the same graph
ggplot(mean_duration_clean_surface, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations by Day of Year (Grass Surface)",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Use a color scale to differentiate yearsConclusion
The analysis of the seasonality of ATP match durations shows a clear pattern centered around the dates of Grand Slam tournaments. These major events tend to exhibit significant differences in match duration, as they often feature the most intense and competitive matches. Additionally, by comparing data across different years, we can observe a general trend towards an increase in the average match duration. However, this analysis will be further detailed in subsequent steps to explore the specific factors contributing to this change over time.
In this analysis, we will focus on studying the long-term trend in the average duration of tennis matches. So far, we have observed indications that matches have become longer over the years, but we want to verify whether this hypothesis holds with a more detailed analysis. To do so, we will explore the evolution of average match duration over time, identifying possible patterns and confirming whether there has indeed been a significant increase in match length.
# Convert 'tourney_date' to Date format
mean_duration$tourney_date <- as.Date(mean_duration$tourney_date)
# Ensure 'mean_minutes' is numeric
mean_duration$mean_minutes <- as.numeric(mean_duration$mean_minutes)
# Filter out rows where 'mean_minutes' is NA
mean_duration_clean <- mean_duration %>%
filter(!is.na(mean_minutes))
# Create a tsibble object with 'tourney_date' as the index and 'year' as the key
mean_duration_tsibble <- mean_duration_clean %>%
mutate(year = year(tourney_date), month = month(tourney_date), day = day(tourney_date)) %>%
as_tsibble(index = tourney_date, key = year)
# Plot the time series for each year on the same graph to observe seasonality
ggplot(mean_duration_tsibble, aes(x = tourney_date, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Create a line plot for each year
labs(title = "Seasonality of ATP Match Durations by Year",
x = "Date",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d() # Optionally, use a color scale for better differentiationCONCLUSION IN THE LONG-TERM TREND
Our analysis of the long-term trend in match duration confirms our initial hypothesis: there is an increase in the average match length over the years. However, this growth is not as pronounced as we initially expected.
While the trend suggests a gradual rise, the increment is less marked than anticipated. This indicates that, although factors such as playing styles, player endurance, or rule changes may have influenced match durations, their impact has been more moderate than we originally theorized. Further analysis will help pinpoint the exact causes behind this evolution and its implications for the sport.
During the data visualization phase, we identified noticeable drops in match duration between 1990 and 1999. These dips show significantly lower average match times than expected, deviating from the overall trend observed in later years.
In this section, we will explore potential explanations for these anomalies. Possible factors may include changes in tournament formats, differences in playing conditions, or even data inconsistencies. By analyzing these variations in depth, we aim to determine whether these declines are due to external influences or if they reveal underlying patterns in match duration.
# Filter out rows where 'mean_minutes' is NA and keep only years between 1991 and 1999
mean_duration_clean <- mean_duration %>%
filter(!is.na(mean_minutes), year(tourney_date) >= 1991, year(tourney_date) <= 1999)
# Extract year and day of the year (day of year is a number from 1 to 365 or 366)
mean_duration_clean <- mean_duration_clean %>%
mutate(year = year(tourney_date), day_of_year = yday(tourney_date))
# Plot the time series for each year, overlaying them on the same graph
ggplot(mean_duration_clean, aes(x = day_of_year, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Plot a line for each year
labs(title = "Seasonality of ATP Match Durations (1991-1999)",
x = "Day of Year",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d()We will analyze tournaments where the average match duration was less than 60 minutes. Identifying these cases will help us understand if there were specific conditions, such as rule changes, extreme weather conditions, or other factors, that led to significantly shorter matches. By examining these tournaments, we aim to gain insights into potential anomalies in match duration trends.
mean_duration <- mean_duration %>%
dplyr::mutate(tourney_date = lubridate::as_date(tourney_date))
ATP_data_filtered <- ATP_data_filtered %>%
dplyr::mutate(tourney_date = lubridate::as_date(as.character(tourney_date), format = "%Y%m%d"))
mean_duration_short_matches <- mean_duration %>%
dplyr::filter(mean_minutes < 60) %>%
dplyr::left_join(
ATP_data_filtered %>%
dplyr::select(tourney_date, tourney_name) %>%
dplyr::distinct(),
by = "tourney_date"
)
mean_duration_short_matchesObs The NA’s correspond to the above tournament
After analyzing the table results, we realize that the average match duration values below 60 minutes are actually errors within the dataset. These values do not accurately reflect the actual duration of the matches, so we will proceed to remove them. This will allow us to present a final time series that is more precise and representative of the average match duration in the analyzed tournaments.
We have now removed the erroneous values identified earlier, ensuring that only valid match durations remain in our dataset. By doing this, we have refined our data and can now present the final time series, which accurately reflects the evolution of match durations over time.
# Filter out incorrect matches with an average duration below 60 minutes and remove NA values
mean_duration_clean <- mean_duration %>%
filter(!is.na(mean_minutes) & mean_minutes >= 60)
# Create a tsibble object with 'tourney_date' as the index and 'year' as the key
mean_duration_tsibble <- mean_duration_clean %>%
mutate(year = year(tourney_date), month = month(tourney_date), day = day(tourney_date)) %>%
as_tsibble(index = tourney_date, key = year)
# Plot the time series for each year on the same graph to observe seasonality
ggplot(mean_duration_tsibble, aes(x = tourney_date, y = mean_minutes, color = as.factor(year))) +
geom_line() + # Create a line plot for each year
labs(title = "Seasonality of ATP Match Durations by Year",
x = "Date",
y = "Match Duration (Minutes)",
color = "Year") +
theme_minimal() +
scale_color_viridis_d()We apply a logarithmic transformation to the average match duration variable to reduce variability and stabilize variance. Initially, this transformation would not be strictly necessary, as we do not observe a significant increase in error as values grow. However, we implement it as an additional consideration to verify this initial assumption and ensure that the model is not affected by potential heteroscedasticity that may not be evident in a preliminary analysis.
# Plot the time series using the specified structure
mean_duration_tsibble |>
autoplot(log(mean_minutes)) +
labs(y = "Log Match Duration (Minutes)",
title = "Seasonality of ATP Match Durations by Year")As expected, applying the logarithmic transformation does not introduce any significant changes to the results. Since the error remains stable and the variability does not show any concerning patterns, we will proceed with the original time series without transformation.
Now that we have cleaned our dataset and constructed the final time series, we will proceed with a statistical analysis to better understand its characteristics. This section will cover various tests and methods to evaluate the properties of our time series, including stationarity, autocorrelation, and heteroscedasticity. These analyses will help us determine the most suitable approach for further modeling and forecasting.
As can be seen in the time series, there is a slight but noticeable linear growth trend over time. This suggests that the series is not completely stationary and that differencing is required to remove the trend component. To address this, we set d=1, ensuring that our ARIMA model captures changes rather than absolute levels, making it more suitable for forecasting future values.
# Convert tourney_date to Year-Month format and aggregate by month
mean_duration_tsibble <- mean_duration_clean %>%
mutate(tourney_date = yearmonth(tourney_date)) %>%
group_by(tourney_date) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_date) %>%
fill_gaps() # Fill missing months with explicit NAs
# Ensure mean_minutes is numeric
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = as.numeric(mean_minutes))
# Compute ACF & PACF with differenced series
mean_duration_tsibble %>%
mutate(diff_minutes = difference(mean_minutes)) %>%
gg_tsdisplay(diff_minutes, plot_type = "partial", lag = 36) +
ggtitle("ACF & PACF - First Difference of Match Duration (Lag=36)")Observing the ACF and partial ACF plots after taking the first difference, we can clearly see that there is a correlation with past values. Specifically, we focus on the significant lag at 12 months (1 year), which corresponds to the same tournament played in the previous year. This suggests a strong dependence due to similar conditions occurring at the same time each year.
Additionally, shorter-term dependencies, particularly those at 1 or 2 months, can be explained by the seasonal nature of the tennis calendar. Tournaments with similar characteristics are usually scheduled around the same period each year, such as the clay-court season, the grass-court season, and the North American hard-court swing. These structured tours lead to the observed dependencies in the time series.
In this section, we will perform a time series decomposition to analyze the underlying components of match duration over time. By breaking down the series into trend, seasonality, and residual components, we aim to better understand its structure and identify potential patterns. This step will help us assess the influence of recurring events, such as Grand Slam tournaments, and guide further modeling decisions.
mean_duration_tsibble <- mean_duration_clean %>%
as_tsibble(index = tourney_date) %>%
fill_gaps() %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
print(interval(mean_duration_tsibble)) ## <interval[1]>
## [1] 1D
dcmp <- mean_duration_tsibble |>
model(stl = STL(mean_minutes ~ season(window = 4)))
components(dcmp) |> autoplot() +
labs(title = "STL Decomposition of Match Durations",
x = "Date", y = "Match Duration (Minutes)")The decomposition results reveal a clear and easily identifiable trend, indicating that the average match duration has been increasing over the years. However, the seasonal component presents some challenges, as the dates of the tournaments do not always align exactly with those of previous years. This misalignment introduces errors in the seasonal pattern, making it less reliable.
To address this issue, we would need to use more advanced models capable of handling event-based seasonality more effectively. However, since this goes beyond the scope of our current study, we will proceed with this decomposition for analysis purposes, acknowledging that it is not the most precise approach.
Explanation
# Define the year to analyze
selected_year <- 2023
# Filter the data for the selected year only
mean_duration_clean_year <- mean_duration_clean %>%
mutate(tourney_date = as.Date(tourney_date)) %>% # Convert to Date format
filter(year(tourney_date) == selected_year) # Keep only data from the selected year
# Convert to tsibble and fill missing dates
mean_duration_tsibble <- mean_duration_clean_year %>%
as_tsibble(index = tourney_date) %>%
fill_gaps() %>% # Fill implicit gaps in the time series
mutate(mean_minutes = as.numeric(mean_minutes)) %>% # Ensure numeric format
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes)) # Replace missing values with the mean
# Mark the Grand Slam tournament weeks as binary (dummy) variables
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(
AusOpen = ifelse(between(tourney_date, as.Date(paste0(selected_year, "-01-16")), as.Date(paste0(selected_year, "-01-29"))), 1, 0),
RolandGarros = ifelse(between(tourney_date, as.Date(paste0(selected_year, "-05-22")), as.Date(paste0(selected_year, "-06-04"))), 1, 0),
Wimbledon = ifelse(between(tourney_date, as.Date(paste0(selected_year, "-07-03")), as.Date(paste0(selected_year, "-07-16"))), 1, 0),
USOpen = ifelse(between(tourney_date, as.Date(paste0(selected_year, "-08-28")), as.Date(paste0(selected_year, "-09-10"))), 1, 0)
)
# Plot match durations with Grand Slam weeks highlighted
ggplot(mean_duration_tsibble, aes(x = tourney_date, y = log(mean_minutes))) +
geom_line(color = "gray") + # General trend in gray
geom_point(data = mean_duration_tsibble %>% filter(AusOpen == 1), aes(y = log(mean_minutes)), color = "blue", size = 2) + # Australian Open in blue
geom_point(data = mean_duration_tsibble %>% filter(RolandGarros == 1), aes(y = log(mean_minutes)), color = "red", size = 2) + # Roland Garros in red
geom_point(data = mean_duration_tsibble %>% filter(Wimbledon == 1), aes(y = log(mean_minutes)), color = "green", size = 2) + # Wimbledon in green
geom_point(data = mean_duration_tsibble %>% filter(USOpen == 1), aes(y = log(mean_minutes)), color = "purple", size = 2) + # US Open in purple
labs(title = paste("Match Durations During Grand Slam Weeks -", selected_year),
x = "Date",
y = "Log Match Duration (Minutes)") +
theme_minimal()In the previous graph, we can observe that the model correctly captures the first two and last two weeks of the seasonal pattern. However, there is a clear error in the middle of the period, where the model is incorrectly assigning seasonality to a week prior to the actual pattern. This misalignment results in an incorrect seasonal adjustment, causing higher residuals.
This shift in seasonality is the main reason for the increased residual variance. Unfortunately, at this stage, we cannot directly correct this issue with our current approach. More advanced models would be required to better capture event-based seasonality, but for now, we will proceed with the current decomposition despite its limitations.
In this section, we will perform model selection for ARIMA, focusing on incorporating seasonality and trend into our forecasts. As previously discussed, one of the main challenges we face is that our initial model does not fully capture the seasonal patterns present in the data. To address this, we will test different SARIMA configurations with various parameter combinations to determine whether we can better identify the underlying patterns.
By systematically experimenting with different seasonal and non-seasonal parameters, we aim to refine our model selection process and improve predictive performance. Our goal is to find the most accurate and reliable model that effectively captures both short-term dependencies and long-term seasonal structures in our time series.
# Convert data into a tsibble and handle missing values
mean_duration_tsibble <- mean_duration_clean %>%
as_tsibble(index = tourney_date) %>%
fill_gaps() %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Perform STL decomposition
dcmp <- mean_duration_tsibble |>
model(stl = STL(mean_minutes ~ season(window = 4)))
# Extract remainder component
remainder_component <- components(dcmp) |> dplyr::select(tourney_date, remainder)
# Plot the STL decomposition
stl_plot <- autoplot(components(dcmp)) +
labs(title = "STL Decomposition of Match Durations", x = "Date", y = "Match Duration (Minutes)")
# Compute and plot ACF of the remainder component
acf_plot <- remainder_component |> ACF(remainder) |> autoplot() +
labs(title = "ACF of the Remainder Component")
# Compute and plot PACF of the remainder component
pacf_plot <- remainder_component |> PACF(remainder) |> autoplot() +
labs(title = "PACF of the Remainder Component")
# Arrange the plots in a grid (STL on top, ACF & PACF side by side below)
final_plot <- stl_plot / (acf_plot | pacf_plot)
# Display the final plot
print(final_plot)The STL decomposition provides insights into the underlying patterns of match durations. It reveals a long-term increasing trend, as well as two seasonal components labeled as season_year and season_week. However, due to the factors previously discussed, we recognize that the model does not perfectly capture seasonality. Despite this limitation, we will assume the results as valid while acknowledging potential inaccuracies.
The trend component shows a slight upward movement over time, while the weekly seasonality appears to be more pronounced than the yearly pattern. The remainder component seems to be stationary, which is a positive indication for fitting an ARIMA model.
Looking at the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, we can identify key characteristics for model selection. The ACF plot suggests a strong correlation at multiples of 7 days, indicating a weekly cycle in the data. The slow decay of the autocorrelations implies that the series is not completely stationary, suggesting the need for at least one level of differentiation. Meanwhile, the PACF plot shows a sharp cutoff after the first lag, which is typically a sign that an AR(1) or AR(2) process might be suitable.
Given these observations and the assumption that our seasonal decomposition is approximately correct, we propose the following approach for ARIMA modeling. First, we will apply one seasonal difference with a lag of 7 days to address the weekly pattern. Additionally, we may apply a regular difference if the series remains non-stationary. Based on the ACF and PACF behaviors, an ARIMA(1,1,1) model with seasonal differencing (lag = 7 days) seems like a reasonable starting point.
# Step 1: Convert the data into a weekly tsibble
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>% # Convert date to weekly format
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week)
# Step 2: Interpolate missing values using linear interpolation
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = zoo::na.approx(mean_minutes, na.rm = FALSE))
# Step 3: Ensure all time points are present (fill any remaining gaps)
mean_duration_tsibble <- mean_duration_tsibble %>%
fill_gaps()
# Step 4: Filter data until 2023 for training
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2023)
# Step 5: Fit the ARIMA model with seasonal differencing (weekly lag)
fit_duration <- train_duration %>%
model(ARIMA(mean_minutes ~ pdq(1,1,1) + PDQ(0,1,1,7)))
# Step 6: Generate future dates correctly
last_week <- max(mean_duration_tsibble$tourney_week) # Last week in the dataset
future_dates <- tibble(
tourney_week = seq(last_week + 1, by = 1, length.out = 104) # Generate 104 future weeks
) %>%
as_tsibble(index = tourney_week)
# Step 7: Generate forecast with confidence intervals
forecast_duration <- fit_duration %>% forecast(new_data = future_dates)
# Step 8: Plot actual data vs. predictions
forecast_duration %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data for 'Mean Match Duration (Weekly)'",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()Observations
As expected, the prediction results are not satisfactory. Our model struggles to capture the complex patterns present in the data, leading to poor forecasting performance. Despite incorporating a seasonal ARIMA structure, the model fails to fully account for the underlying seasonality and intricate variations in match duration. This confirms our initial assumption that the time series contains patterns that our approach is unable to detect effectively. Consequently, while the forecast follows the general trend, it lacks precision and does not accurately reflect the real fluctuations in the data.
Since our previous model failed to capture the underlying seasonality, we will now attempt to define it manually. Instead of relying on automated decomposition, we will explicitly introduce seasonal effects based on domain knowledge. In tennis, match durations tend to follow specific patterns, particularly around the four Grand Slam tournaments (Australian Open, French Open, Wimbledon, and US Open). These events occur at fixed periods within the year and significantly influence match lengths due to longer formats and higher competition intensity. By incorporating these seasonal factors manually, we aim to improve the predictive performance of our model.
We will start with an ARIMA(0,1,1) + Grand Slam model. Initially, we assume that tournaments are independent of each other, meaning match durations do not directly depend on past values. The first difference (d=1) helps remove potential trends, while the moving average component (MA(1)) captures short-term dependencies or sudden changes in match durations.
However, we manually introduce seasonality by incorporating the effect of Grand Slam tournaments. These events occur annually around the same dates and under consistent conditions, potentially influencing match durations. By explicitly including Grand Slams as a factor, we ensure that our model accounts for the recurring impact of these major tournaments, rather than relying solely on traditional seasonal components within the ARIMA framework.
# Step 1: Convert data into a tsibble with weekly aggregation
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>% # Convert to weekly format
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week) %>%
fill_gaps() # Fill missing weeks
# Ensure no NA values in mean_minutes
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Step 2: Define Grand Slam weeks (approximate based on historical schedules)
grand_slam_weeks <- c(
yearweek("2023-01-15"), # Australian Open (mid-January)
yearweek("2023-05-28"), # French Open (late May - early June)
yearweek("2023-07-02"), # Wimbledon (early July)
yearweek("2023-08-27") # US Open (late August - early September)
)
# Step 3: Create a binary variable for Grand Slam periods
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(grand_slam = if_else(tourney_week %in% grand_slam_weeks, 1, 0))
# Step 4: Train ARIMA model incorporating the Grand Slam effect
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2023)
fit_duration <- train_duration %>%
model(ARIMA(mean_minutes ~ pdq(0,1,1) + grand_slam))
# Step 5: Generate future dates correctly
last_week <- max(mean_duration_tsibble$tourney_week)
future_dates <- tibble(
tourney_week = seq(last_week + 1, by = 1, length.out = 104)
) %>%
as_tsibble(index = tourney_week) %>%
mutate(grand_slam = if_else(tourney_week %in% grand_slam_weeks, 1, 0))
# Step 6: Generate predictions with confidence intervals
forecast_duration <- fit_duration %>% forecast(new_data = future_dates)
# Step 7: Plot actual vs predicted data
forecast_duration %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data with Manually Defined Seasonality",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()The prediction results have exceeded our expectations, showing a significant improvement over our initial model. By manually specifying the seasonality associated with the four Grand Slam tournaments, we have successfully captured some of the recurring patterns in match durations that were previously overlooked. This adjustment has led to a more accurate forecast, as the model now accounts for the known periods of increased match length due to major tournaments.
Although the ACF and PACF plots do not explicitly indicate an annual dependence, our initial analysis suggests that there is indeed a yearly pattern. This is because tournaments are typically held during the same weeks each year, meaning that external conditions such as weather, player schedules, and tournament structures remain similar from one year to the next. As a result, the match durations from the same period in previous years can be used as a reliable reference. For this reason, we incorporate a seasonal period of 52 weeks into our model to account for this yearly recurrence.
This is an ARIMA(0,1,1) + (1,1,1)_52. This model builds on the assumption that match durations follow a non-stationary process, requiring a first difference (d=1) to remove trends, and includes a moving average component (MA(1)) to capture short-term dependencies.
The (1,1,1)_52 seasonal component explicitly models yearly seasonality, as tennis tournaments often follow a structured calendar. By using a period of 52 weeks, we account for the recurrence of similar tournaments at the same time each year.
# Step 1: Convert data into a weekly tsibble
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>% # Convert date to weekly format
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week)
# Step 2: Interpolate missing values between previous and next using linear interpolation
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = zoo::na.approx(mean_minutes, na.rm = FALSE))
# Step 3: Ensure all time points are present (fill any remaining gaps)
mean_duration_tsibble <- mean_duration_tsibble %>%
fill_gaps()
# Step 4: Filter data up to 2023 for training
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2023)
# Step 5: Fit the SARIMA model with seasonality
fit_duration_seasonal <- train_duration %>%
model(ARIMA(mean_minutes ~ pdq(0,1,1) + PDQ(0,1,1,52)))
# Step 6: Generate future dates correctly
last_week <- max(mean_duration_tsibble$tourney_week) # Get the last available week
future_dates <- tibble(
tourney_week = seq(last_week + 1, by = 1, length.out = 104) # Create 104 future weeks
) %>%
as_tsibble(index = tourney_week)
# Step 7: Generate the forecast with confidence intervals
forecast_duration_seasonal <- fit_duration_seasonal %>% forecast(new_data = future_dates)
# Step 8: Plot actual data vs. forecast
forecast_duration_seasonal %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data for 'Mean Match Duration (Weekly)' (SARIMA Model)",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()The annual dependence yields good results, confirming our hypothesis that tournaments played in the same weeks each year share similar conditions. This insight was made possible thanks to the initial data analysis conducted at the beginning of the study, as well as my personal experience watching tennis. The consistency in scheduling means that factors such as weather, court conditions, and player performance trends remain comparable across years, making the incorporation of a 52-week seasonality a logical and effective choice for improving the model’s accuracy.
The final model we will consider is the one automatically generated by the ARIMA function. This approach allows the model to determine the optimal parameters based on the data without manually specifying orders or seasonal components. By letting ARIMA select the best configuration, we can compare its performance against the models we previously designed, assessing whether the automatic approach captures the underlying patterns more effectively or if the manually fine-tuned models provide better results.
# Step 1: Convert data into a tsibble with weekly aggregation
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>% # Convert date to weekly format
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>% # Compute weekly average
as_tsibble(index = tourney_week) %>%
fill_gaps() # Fill missing weeks
# Ensure there are no NA values in mean_minutes
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Step 2: Fit an ARIMA model to all available data
fit_duration <- mean_duration_tsibble %>%
model(ARIMA(mean_minutes)) # Train ARIMA model
# Step 3: Generate future dates starting from the last available week
last_week <- max(mean_duration_tsibble$tourney_week) # Get last recorded week
future_dates <- tibble(
tourney_week = seq(last_week + 1, by = 1, length.out = 104) # Generate 34 weeks into the future
) %>%
as_tsibble(index = tourney_week)
# Step 4: Generate forecasts
forecast_duration <- fit_duration %>% forecast(new_data = future_dates)
# Step 5: Plot the forecast
forecast_duration %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast for 'Mean Match Duration (Weekly)'",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()## Series: mean_minutes
## Model: ARIMA(2,0,1)(0,1,1)[52] w/ drift
##
## Coefficients:
## ar1 ar2 ma1 sma1 constant
## -0.1574 -0.2134 0.3464 -0.4199 0.4434
## s.e. 0.0907 0.0315 0.0930 0.0256 0.2613
##
## sigma^2 estimated as 177: log likelihood=-6622.29
## AIC=13256.57 AICc=13256.63 BIC=13289.03
Explanation
The AutoARIMA model selected is ARIMA(2,0,1)(0,1,1)[52] with drift, meaning it captures both short-term and seasonal patterns. The AR(2) and MA(1) components suggest that match duration depends on the previous two weeks and a moving average effect. The seasonal differencing (D=1) and MA(1) at a 52-week period indicate an annual pattern, aligning with our understanding that tournaments occur around the same time each year. The drift (0.4434) suggests a slight upward trend in match duration over time. These parameters were chosen automatically based on AIC, optimizing the model’s balance between fit and complexity.
In this section, we will evaluate the different models we have proposed by comparing their predictions for 2023 with the actual values. This comparison will help us determine which model best captures the underlying patterns in match duration and assess the accuracy of our forecasting approaches. By analyzing the discrepancies between predicted and observed values, we can identify strengths and limitations in each model and refine our approach accordingly.
# Step 1: Convert the data into a weekly tsibble
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>% # Convert date to weekly format
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week)
# Step 2: Interpolate missing values using linear interpolation
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = zoo::na.approx(mean_minutes, na.rm = FALSE))
# Step 3: Ensure all time points are present (fill any remaining gaps)
mean_duration_tsibble <- mean_duration_tsibble %>%
fill_gaps()
# Step 4: Split data into training (until 2022) and actual 2023 values
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2022)
test_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) == 2023 & tourney_week <= yearweek("2023 W34"))
# Step 5: Fit the ARIMA model with seasonal differencing (weekly lag)
fit_duration <- train_duration %>%
model(ARIMA(mean_minutes ~ pdq(1,1,1) + PDQ(0,1,1,7)))
# Step 6: Generate forecasts for 2023
forecast_duration <- fit_duration %>%
forecast(h = nrow(test_duration)) # Forecast only for available 2023 weeks
# Step 7: Compute RMSE and MAPE comparing forecast vs actual values
results <- forecast_duration %>%
filter(tourney_week %in% test_duration$tourney_week) %>%
left_join(test_duration, by = "tourney_week") %>%
mutate(error = .mean - mean_minutes.y,
abs_percentage_error = abs(error / mean_minutes.y) * 100)
rmse <- sqrt(mean(results$error^2, na.rm = TRUE))
mape <- mean(results$abs_percentage_error, na.rm = TRUE)
# Step 8: Plot actual data vs. predictions
forecast_duration %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data for 'Mean Match Duration (Weekly)'",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()## [1] "RMSE of the model: 35.44"
## [1] "MAPE of the model: 12.3 %"
As expected, this first model performs poorly in predicting match durations. Its inability to account for the underlying seasonality patterns significantly impacts its accuracy, leading to a high prediction error. This confirms our initial hypothesis that a more sophisticated approach, incorporating seasonal dependencies, is necessary to improve forecasting performance.
# Step 1: Convert the dataset into a tsibble with weekly aggregation
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>%
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week) %>%
fill_gaps()
# Ensure no NA values in mean_minutes
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Step 2: Define Grand Slam weeks (up to 2022)
grand_slam_weeks <- c(
yearweek("2022-01-16"), # Australian Open (mid-January)
yearweek("2022-05-29"), # French Open (late May - early June)
yearweek("2022-07-03"), # Wimbledon (early July)
yearweek("2022-08-28"), # US Open (late August - early September)
yearweek("2021-01-17"),
yearweek("2021-05-30"),
yearweek("2021-07-04"),
yearweek("2021-08-29"),
yearweek("2020-01-19"),
yearweek("2020-09-27"),
yearweek("2020-07-05"),
yearweek("2020-09-13")
)
# Step 3: Create a binary variable for Grand Slam periods
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(grand_slam = if_else(tourney_week %in% grand_slam_weeks, 1, 0))
# Step 4: Filter data until 2022 for training
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2022)
# Step 5: Train the ARIMA model incorporating Grand Slam effect
fit_duration <- train_duration %>%
model(ARIMA(mean_minutes ~ pdq(0,1,1) + grand_slam))
# Step 6: Generate future dates for 2023 until week 34
first_forecast_week <- max(train_duration$tourney_week) + 1 # Ensure forecast starts right after training data
future_dates <- tibble(
tourney_week = seq(first_forecast_week, by = 1, length.out = 34)
) %>%
as_tsibble(index = tourney_week) %>%
mutate(grand_slam = if_else(tourney_week %in% grand_slam_weeks, 1, 0))
# Step 7: Generate the forecast for the first 34 weeks of 2023
forecast_2023 <- fit_duration %>% forecast(new_data = future_dates)
# Step 8: Get actual values from 2023 up to week 34
actual_2023 <- mean_duration_tsibble %>%
filter(year(tourney_week) == 2023, week(tourney_week) <= 34)
# Step 9: Merge predictions and actual values
comparison_2023 <- forecast_2023 %>%
left_join(actual_2023, by = "tourney_week") %>%
mutate(error = .mean - mean_minutes.y,
abs_percentage_error = abs(error / mean_minutes.y) * 100)
# Step 10: Compute RMSE and MAPE
rmse_2023 <- sqrt(mean((comparison_2023$error)^2, na.rm = TRUE))
mape_2023 <- mean(comparison_2023$abs_percentage_error, na.rm = TRUE)
# Print RMSE and MAPE
print(paste("RMSE:", round(rmse_2023, 2)))## [1] "RMSE: 28.7"
## [1] "MAPE: 10.59 %"
# Step 11: Plot forecast vs actual data for 2023 (up to week 34)
forecast_2023 %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data (2023 W1 - W34)",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()This second model represents an improvement over the first one, as the predictions visually align better with the actual data. However, the RMSE has not decreased drastically: in the first model, it was around 35, while now it is approximately 28.
This is because the model fails to accurately identify when the next Grand Slam will be played, leading to significant prediction errors. For example, Roland Garros takes place in weeks 22 and 23, but the model assumes it happens in weeks 21 and 22, which increases the overall prediction error.
# Select weeks 20 to 30 from actual and predicted values with clearer column names
comparison_weeks <- comparison_2023 %>%
filter(tourney_week >= yearweek("2023 W20") & tourney_week <= yearweek("2023 W30")) %>%
dplyr::select(tourney_week, actual_duration = mean_minutes.y, predicted_duration = .mean)
# Print table
print(comparison_weeks)## # A tsibble: 11 x 3 [1W]
## tourney_week actual_duration predicted_duration
## <week> <dbl> <dbl>
## 1 2023 W20 106. 109.
## 2 2023 W21 101. 162.
## 3 2023 W22 225 163.
## 4 2023 W23 225 113.
## 5 2023 W24 102. 103.
## 6 2023 W25 103. 105.
## 7 2023 W26 109. 164.
## 8 2023 W27 158. 150.
## 9 2023 W28 161. 108.
## 10 2023 W29 106. 111.
## 11 2023 W30 108. 114.
As we can see in the table, the model assumes that Roland Garros is played in weeks 21 and 22, while in reality, it takes place in weeks 22 and 23. This misalignment causes the model to anticipate the increase in match duration one week earlier than it should and fail to capture the subsequent decline correctly, leading to a significant increase in prediction error.
# Step 1: Convert data into a tsibble with weekly aggregation
mean_duration_tsibble <- mean_duration_clean2 %>%
mutate(tourney_week = yearweek(tourney_date)) %>%
group_by(tourney_week) %>%
summarise(mean_minutes = mean(mean_minutes, na.rm = TRUE)) %>%
as_tsibble(index = tourney_week) %>%
fill_gaps()
# Ensure there are no NA values in mean_minutes
mean_duration_tsibble <- mean_duration_tsibble %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Step 2: Filter data up to 2022 for training
train_duration <- mean_duration_tsibble %>%
filter(year(tourney_week) <= 2022) # Train only with data until 2022
# Step 3: Fit an ARIMA model
fit_duration <- train_duration %>%
model(ARIMA(mean_minutes)) # Train ARIMA model
# Step 4: Generate future dates for 2023 (up to W34)
future_dates <- tibble(
tourney_week = seq(yearweek("2023 W01"), yearweek("2023 W34"), by = 1) # Forecast weeks for 2023
) %>%
as_tsibble(index = tourney_week)
# Step 5: Generate forecasts for 2023
forecast_duration <- fit_duration %>% forecast(new_data = future_dates)
# Step 6: Calculate RMSE and MAPE comparing actual vs predicted values
comparison_2023 <- mean_duration_tsibble %>%
filter(year(tourney_week) == 2023 & tourney_week <= yearweek("2023 W34")) %>%
left_join(forecast_duration, by = "tourney_week") %>%
mutate(error = mean_minutes.x - .mean,
abs_percentage_error = abs(error / mean_minutes.x) * 100) # Compute absolute percentage error
# Compute RMSE and MAPE
rmse_2023 <- sqrt(mean(comparison_2023$error^2, na.rm = TRUE))
mape_2023 <- mean(comparison_2023$abs_percentage_error, na.rm = TRUE)
# Step 7: Plot actual vs predicted values
forecast_duration %>%
autoplot(mean_duration_tsibble) +
labs(title = "Forecast vs Real Data (Trained until 2022, Predicted 2023)",
y = "Match Duration (Minutes)", x = "Week") +
theme_minimal()## [1] "RMSE for 2023 up to W34: 27.46"
## [1] "MAPE for 2023 up to W34: 8.83 %"
# Select weeks 20 to 30 from actual and predicted values with clearer column names
comparison_weeks <- comparison_2023 %>%
filter(tourney_week >= yearweek("2023 W20") & tourney_week <= yearweek("2023 W30")) %>%
dplyr::select(tourney_week, actual_duration = mean_minutes.x, predicted_duration = .mean)
# Print table
print(comparison_weeks)## # A tsibble: 11 x 3 [1W]
## tourney_week actual_duration predicted_duration
## <week> <dbl> <dbl>
## 1 2023 W20 106. 107.
## 2 2023 W21 101. 145.
## 3 2023 W22 225 160.
## 4 2023 W23 225 111.
## 5 2023 W24 102. 101.
## 6 2023 W25 103. 103.
## 7 2023 W26 109. 147.
## 8 2023 W27 158. 147.
## 9 2023 W28 161. 106.
## 10 2023 W29 106. 109.
## 11 2023 W30 108. 112.
Just like before, we improved our model by using autoARIMA, which helped optimize the parameters and enhance the overall prediction accuracy. However, a significant issue remains: our model does not accurately identify the exact timing of Grand Slam tournaments. Since these events do not always occur on the same fixed weeks each year and instead fluctuate slightly over time, our model struggles to capture their impact on match durations correctly.
As a result, the predicted durations show noticeable discrepancies around these key events, leading to a relatively high error. This misalignment contributes to a considerable RMSE, as our forecasts either anticipate the Grand Slam effects too early or too late, affecting the overall accuracy of our model.
At the beginning of this study, we successfully identified the seasonality patterns driven by the timing of the different Grand Slam tournaments. However, our models struggled to fully capture these patterns due to their slight fluctuations over time. This misalignment led to noticeable prediction errors, as the models were unable to precisely anticipate the weeks in which these key events would take place.
Despite this challenge, we can confidently conclude that there is a clear upward trend in match duration, aligning with our initial hypothesis. Our models effectively capture this long-term growth, allowing us to predict the general increase in match duration over time. Nevertheless, these predictions come with inherent errors caused by the inability to fully model the seasonality shifts of Grand Slam events.
# Convert the dataset into a tsibble and handle missing values
mean_duration_tsibble <- mean_duration_clean %>%
as_tsibble(index = tourney_date) %>%
fill_gaps() %>%
mutate(mean_minutes = ifelse(is.na(mean_minutes), mean(mean_minutes, na.rm = TRUE), mean_minutes))
# Perform STL decomposition
dcmp <- mean_duration_tsibble |>
model(stl = STL(mean_minutes ~ season(window = 4))) |>
components() # Extract decomposition components
# Extract only the trend component
trend_component <- dcmp %>%
dplyr::select(tourney_date, trend) # Explicit call to dplyr::select()
# Plot only the trend component
trend_plot <- ggplot(trend_component, aes(x = tourney_date, y = trend)) +
geom_line(color = "blue", size = 1) +
labs(title = "Trend Component of Match Durations",
x = "Date",
y = "Trend of Match Duration (Minutes)") +
theme_minimal()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In conclusion, the autoARIMA model has proven to be the most accurate in predicting the average match duration, achieving the best MAPE result at 8%, compared to the other evaluated models. This suggests that the automatic parameter selection of autoARIMA has effectively captured the temporal dynamics of the data, providing a more reliable and well-fitted estimation.